Shortest distance between strings - Levenstein algorithm¶
Top horizontal row is always 1,2,... -- cost of insertions [j]
Left vertical column is always 1,2,... -- cost of deletions [i]
begin at upper left (<= 0)
to fill in a cell: diag above
left min (above + delete, diag + replace, left + insert)
See: http://www.let.rug.nl/kleiweg/lev/
https://people.cs.umass.edu/~mccallum/courses/cl2006/lect4-stredit.pdf
Given strings S and T.
Distance is shortest sequence of edit commands that transform S to T,
(or equivalently T to S).
Simple set of operations:
Copy character from S over to T: cost = 0
Delete a character in S: cost = 1
Insert a character in T: cost = 1
Substitute one character for another: cost = 1
There are lots of applications of Levenshtein distance. It is used in biology
to find similar sequences of nucleic acids in DNA or amino acids in proteins.
It is used in some spell checkers to guess at which word (from a dictionary)
is meant when an unknown word is encountered.
Wilbert Heeringa's dialectology project uses Levenshtein distance
to estimate the proximity of dialect pronunciations.
And some translation assistance projects have used the alignment
capability of the algorithm in order to discover (the approximate location of)
good translation equivalents.
Initial F:
F = [ [i + j if i * j == 0 else 0 for j in range(len(B) + 1)] for i in range(len(A) + 1)]
+-----------------------> j
| b e n i n
| [0, 1, 2, 3, 4, 5],
| b [1, 0, 0, 0, 0, 0],
| i [2, 0, 0, 0, 0, 0],
| g [3, 0, 0, 0, 0, 0],
| b [4, 0, 0, 0, 0, 0],
| e [5, 0, 0, 0, 0, 0],
| n [6, 0, 0, 0, 0, 0]
v
i
def levenstein(A, B):
''' Editor distance
indel =1 - the cost for inserting or deleting a symbol
substitution=1 - the cost for substituting one symbol for another
swap - the cost for swapping two adjacent symbols.
'''
F = [ [i + j if i * j == 0 else 0 for j in range(len(B) + 1)] for i in range(len(A) + 1)]
for i in range(1, len(A) + 1): # +1 - "zero" elements in line
for j in range(1, len(B) + 1): # +1 - "zero" elements in column
if A[i-1] == B[j-1]: # -1 - real indexes
F[i][j] = F[i-1][j-1] # == previous diag distance
else:
# Adding 1 because of error
# and min (shortest distance) distance of possible previous iteratins
F[i][j] = 1 + min(F[i][j-1], F[i-1][j], F[i-1][j-1])
return F[len(A)][len(B)]
Test¶
A = list("bigben")
B = list("bigben")
[[0, 1, 2, 3, 4, 5, 6],
+-----------------> j
[1, 0, 1, 2, 3, 4, 5],
[2, 1, 0, 1, 2, 3, 4],
[3, 2, 1, 0, 1, 2, 3],
[4, 3, 2, 1, 0, 1, 2],
[5, 4, 3, 2, 1, 0, 1],
[6, 5, 4, 3, 2, 1, 0]]
min_dist = levenstein(A, B)
print("Minimal distance = {}".format(min_dist)) # Minimal distance = 0
A = list("b2gben")
B = list("b1gben")
[[0, 1, 2, 3, 4, 5, 6],
+-----------------> j
[1, 0, 1, 2, 3, 4, 5],
[2, 1, 1, 2, 3, 4, 5], # 2, 2
[3, 2, 2, 1, 2, 3, 4],
[4, 3, 3, 2, 1, 2, 3],
[5, 4, 4, 3, 2, 1, 2],
[6, 5, 5, 4, 3, 2, 1]]
min_dist = levenstein(A, B)
print("Minimal distance = {}".format(min_dist)) # Minimal distance = 1 (wrong character)
A = list("bigben")
B = list("bgben")
[[0, 1, 2, 3, 4, 5],
+-----------------> j
[1, 0, 1, 2, 3, 4],
[2, 1, 1, 2, 3, 4],
[3, 2, 1, 2, 3, 4],
[4, 3, 2, 1, 2, 3],
[5, 4, 3, 2, 1, 2],
[6, 5, 4, 3, 2, 1]]
min_dist = levenstein(A, B)
print("Minimal distance = {}".format(min_dist)) # Minimal distance = 1 (missed character)
A = list("bigben")
B = list("benin")
b i g b e n
b e n i n
b e n i n
[ [0, 1, 2, 3, 4, 5],
+-----------------> j
b [1, | 0, 1, 2, 3, 4], b = b => [0, 1, 2, 3, 4]
i [2, | 1, 1, 2, 2, 3], e != i => 1 + 0
g [3, | 2, 2, 2, 3, 3],
b [4, | 3, 3, 3, 3, 4],
e [5, | 4, 3, 4, 4, 4],
n [6, | 5, 4, 3, 4, 4]]
min_dist = levenstein(A, B)
print("Minimal distance = {}".format(min_dist)) # Minimal distance = 4
# TBD
# Recover the path from F(N, M) to F(0, 0)